Iterative random forests to discover predictive and stable high-order interactions

نویسندگان

  • Sumanta Basu
  • Karl Kumbier
  • James B. Brown
  • Bin Yu
چکیده

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human-derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, third-order interactions, e.g., between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF rediscovered a central role of H3K36me3 in chromatin-mediated splicing regulation and identified interesting fifth- and sixth-order interactions, indicative of multivalent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Iterative Random Forests to detect predictive and stable high-order interactions

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genomewide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that operate in vivo as components of larger molecular machines that regulate gene expression. Understanding these processes and the high-order interactions that govern them present...

متن کامل

Random forests algorithm in podiform chromite prospectivity mapping in Dolatabad area, SE Iran

The Dolatabad area located in SE Iran is a well-endowed terrain owning several chromite mineralized zones. These chromite ore bodies are all hosted in a colored mélange complex zone comprising harzburgite, dunite, and pyroxenite. These deposits are irregular in shape, and are distributed as small lenses along colored mélange zones. The area has a great potential for discovering further chromite...

متن کامل

Comparison of Stability Parameters for Detection of Stable and High Essential Oil Yielding Landraces of Rosa damascena Mill.

The essential oil yield stability of damask rose (Rosa damascena Mill.) as an important medicinal and aromatic plant in different environments has not been well documented. In order to determine appropriate stability parameters, six statistics were studied for essential oil stability of 35 Rosa damascena landraces in seven locations (Sanandaj, Arak, Kashan, Dezful, Stahban, Ke...

متن کامل

تحلیل الگوی مکانی و اثرات متقابل بلوط ایرانی و بنه در جنگل‌های قلاجه کرمانشاه با استفاده از تابع K2

     Quercus brantii Lindl. and Pistacia atlantica Desf. are the most important tree species in Zagros forests, The abundant use of these trees by the inhabitants of the area has led to a reduction in the quality and quantity of these valuable species, as well as the creation of heterogeneous masses.Recognizing the spatial pattern and the interactions of trees can be a key to managerial interve...

متن کامل

Stability of variable importance scores and rankings using statistical learning tools on single-nucleotide polymorphisms and risk factors involved in gene × gene and gene × environment interactions

Risk of complex disorders is thought to be multifactorial, involving interactions between risk factors. However, many genetic studies assess association between disease status and markers one single-nucleotide polymorphism (SNP) at a time, due to the high-dimensional nature of the search space of all possible interactions. Three ensemble methods have been recently proposed for use in high-dimen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 115  شماره 

صفحات  -

تاریخ انتشار 2018